Intro to Data Analysis

Introduction to Data Analysis

  • UNIT OF ANALYSIS
  • POPULATION
  • SAMPLE
  • N & n
  • DESCRIPTIVE STATISTICS
  • INFERENTIAL STATISTICS
  • TIDY DATA
  • VARIABLES
  • NOMINAL
  • ORDINAL
  • INTERVAL-RATIO
  • DICHOTOMOUS

Unit of Analysis

Who or what is being studied?


POPULATION

All units of analysis (people, institutions, groups, etc.) in which the researcher is interested.


SAMPLE

A subset of people (or institutions, groups, etc.) selected from a population.

DESCRIPTIVE STATISTICS

Procedures that help us organize and describe data collected from a sample or population.


INFERENTIAL STATISTICS

Making underlying predictions or inferences about a population using observations and analyses from a sample.

Tidy Data

VARIABLES

Any factor, trait, or condition that can exist in differing amounts or types.

Measurement Levels

Nominal
A variable made up of categories that cannot be ordered

Ordinal
A variable made up of ranked categories, with no systematic or measurable numeric difference between the categories.

Interval-ratio (aka continuous)
A variable with categories that are ordered and expressed in the same units.

Dichotomous (aka binary)
A variable with only two categories.

Frequency Distributions

  • FREQUENCY DISTRIBUTION
  • RELATIVE FREQUENCY DISTRIBUTION
  • PROPORTION
  • PERCENTAGE
  • CUMULATIVE
  • RATE
  • BAR GRAPH
  • HISTOGRAM
  • LINE GRAPH
  • STATISTICAL MAP

DISTRIBUTION

Shows all the possible values (or intervals) of the data and how often they occur.


FREQUENCY DISTRIBUTION

A table reporting the number of observations falling into each category of the variable.

gss_all$premarsx <-as_factor(zap_missing(gss_all$premarsx))
gss_all$sex <-as_factor(zap_missing(gss_all$sex))

gss_all %>%
  select(id, year, sex, premarsx) %>%
  filter(year == 2024, !is.na(premarsx)) %>%
  count(premarsx) -> freq_premarsx

freq_premarsx %>%
  summarise(across(where(is.numeric), sum)) %>%
  mutate(premarsx = "Total"
         ) -> total_row

# combine
table_premarsx <- rbind(freq_premarsx, total_row)

Table 1. Attitudes about sex before marriage

# Render the table
table_premarsx %>%
  flextable() %>%
  style_flextable()

premarsx

n

always wrong

357

almost always wrong

122

wrong only sometimes

258

not wrong at all

1,378

Total

2,115

Survey question: There’s been a lot of discussion about the way morals and attitudes about sex are changing in this country. If a man and woman have sex relations before marriage, do you think it is _________.

Table 1. Attitudes about sex before marriage

table_premarsx %>%
  flextable() %>%
  style_flextable() %>%
  color(color = "#E74C3C", i = 5, j = "n")

premarsx

n

always wrong

357

almost always wrong

122

wrong only sometimes

258

not wrong at all

1,378

Total

2,115

The number of respondents who answered this survey question.

Table 1. Attitudes about sex before marriage

table_premarsx %>%
  flextable() %>%
  style_flextable() %>%
  color(color = "#E74C3C", i = 3, j = "n")

premarsx

n

always wrong

357

almost always wrong

122

wrong only sometimes

258

not wrong at all

1,378

Total

2,115

The number of respondents who said pre-marital sex was “wrong only sometimes.”

Source: U.S. General Social Survey 2024

RELATIVE FREQUENCY DISTRIBUTION

A table showing the proportion or percentage for each value of a variable.


Proportions are between 0 and 1.0.

Proportion = count (f) / total number of cases (N).


Percentages are between 0 and 100.

Percentage = proportion × 100.

CUMULATIVE FREQUENCY DISTRIBUTION

The number or percentage of observations at or below a given category.


Table 3. Attitudes about sex before marriage, with cumulative percentages

gss_all$premarsx <- droplevels(gss_all$premarsx)

# Create frequency & proportions table
tab <- gss_all %>%
  filter(year == 2024, !is.na(premarsx)) %>%
  group_by(premarsx) %>%
  summarise(n = n(), .groups = "drop") %>%
  mutate(percent = round(100 * n / sum(n), 0),
         cum_percent = round(cumsum(percent), 0)) %>%
  ungroup() 

# Add totals row
tab_totals <- tab %>%
  summarise(across(where(is.numeric), sum, na.rm = TRUE)) %>%
  mutate(premarsx = "Total")

# Combine with original table
tab_with_totals <- bind_rows(tab, tab_totals)

## Pretty table
tab_with_totals %>%
  flextable() %>% 
  style_flextable() %>%
  set_header_labels(
    n = "n",    percent = "%",
    cum_percent = "cumulative %") %>%
  color(color = "#18bc9c", i = 1, j = 3) %>%
  color(color = "#fd7e14", i = 2, j = 3) %>%
  color(color = "#e74c3c", i = 2, j = 4)

premarsx

n

%

cumulative %

always wrong

357

17

17

almost always wrong

122

6

23

wrong only sometimes

258

12

35

not wrong at all

1,378

65

100

Total

2,115

100

175

\({\color{mathGreen} 17} + {\color{mathOrange} 6} = {\color{mathRed} 23\%}\)

Lab 01

  • CODEBOOK

will add later